RBCN: Rectified Binary Convolutional Networks with Generative Adversarial Learning 63

where RBConv denotes the convolution operation implemented as a new module, F l

in and

F l

out are the feature maps before and after convolution, respectively. W l are full precision

filters, the values of ˆW l are 1 or1, andis the operation of the element-by-element

product.

During the backward propagation process of RBCNs, the full precision filters W and the

learnable matrices C are required to be learned and updated. These two sets of parameters

are jointly learned. We update W first and then C in each convolutional layer.

Update W: Let δW l

i be the gradient of the full precision filter W l

i . During backpropa-

gation, the gradients are first passed to ˆW l

i and then to W l

i . Thus,

δW l

i =L

∂W l

i

= L

ˆ

W l

i

ˆ

W l

i

∂W l

i

,

(3.67)

where

ˆW l

i

∂W l

i

=

1.2 + 2W l

i ,

1W l

i < 0,

22W l

i ,

0W l

i < 1,

10,

otherwise,

(3.68)

which is an approximation of 2× the Dirac delta function [159]. Furthermore,

L

ˆW l

i

= LS

ˆW l

i

+ LKernel

ˆW l

i

+ LAdv

ˆW l

i

,

(3.69)

and

W l

i W l

i η1δW l

i ,

(3.70)

where η1 is the learning rate. Then,

LKernel

ˆW l

i

=λ1(W l

i Cl ˆW l

i )Cl,

(3.71)

LAdv

ˆW l

i

=2(1D(T l

i ; Y )) ∂D

ˆW l

i

.

(3.72)

Update C: We further update the learnable matrix Cl with W l fixed. Let δCl be the

gradient of Cl. Then we have

δCl = LS

∂Cl +LKernel

∂Cl

+ LAdv

∂Cl

,

(3.73)

and

Cl Cl η2δCl,

(3.74)

where η2 is another learning rate. Furthermore,

LKernel

∂Cl

=λ1



i

(W l

i Cl ˆW l

i ) ˆW l

i ,

(3.75)

LAdv

∂Cl

=



i

2(1D(T l

i ; Y )) ∂D

∂Cl .

(3.76)

These derivations show that the rectified process is trainable in an end-to-end manner.

The complete training process is summarized in Algorithm 13, including how to update

the discriminators. As described in line 17 of Algorithm 13, we independently update other

parameters while fixing the convolutional layer’s parameters to enhance each layer’s feature

maps’ variety. This way, we speed up the training convergence and fully explore the potential

of 1-bit networks. In our implementation, all the values of Cl are replaced by their average

during the forward process. A scalar, not a matrix, is involved in inference, thus speeding

up computation.